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Abstract 

We investigate the task of building open domain, conversa¬ 
tional dialogue systems based on large dialogue corpora us¬ 
ing generative models. Generative models produce system 
responses that are autonomously generated word-by-word, 
opening up the possibility for realistic, flexible interactions. 

In support of this goal, we extend the recently proposed hier¬ 
archical recurrent encoder-decoder neural network to the di¬ 
alogue domain, and demonstrate that this model is compet¬ 
itive with state-of-the-art neural language models and back¬ 
off n-gram models. We investigate the limitations of this and 
similar approaches, and show how its performance can be im¬ 
proved by bootstrapping the learning from a larger question- 
answer pair corpus and from pretrained word embeddings. 

Introduction 

Dialogue systems, also known as interactive conversational 
agents, virtual agents and sometimes chatterbots, are used 
in a wide set of applications ranging from technical sup¬ 
port services to language learning tools and entertainment 
(Young et al. 2013; Shawar and Atwell 2007). Dialogue sys¬ 
tems can be divided into goal-driven systems, such as tech¬ 
nical support services, and non-goal-driven systems, such as 
language learning tools or computer game characters. Our 
current work focuses on the second case, due to the avail¬ 
ability of large corpora of this type, though the model may 
eventually prove useful for goal-driven systems also. 

Perhaps the most successful approach to goal-driven sys¬ 
tems has been to view the dialogue problem as a partially 
observable Markov decision process (POMDP) (Young et al. 
2013). Unfortunately, most deployed dialogue systems use 
hand-crafted features for the state and action space repre¬ 
sentations, and require either a large annotated task-specific 
corpus or a horde of human subjects willing to interact with 
the unfinished system. This not only makes it expensive and 
time-consuming to deploy a real dialogue system, but also 
limits its usage to a narrow domain. Recent work has tried 
to push goal-driven systems towards learning with few ex¬ 
amples using constraints on the POMDP (Gasic et al. 2013) 
as well as learning the observed features themselves with 
neural network models (Henderson, Thomson, and Young 
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2014), yet such approaches still require either hand-crafted 
features or large corpora of annotated task-specific simu¬ 
lated conversations. 

On the other end of the spectrum are the non-goal- 
driven systems (Ritter, Cherry, and Dolan 2011; Banchs 
and Li 2012; Ameixa et al. 2014). Most recently Sor¬ 
doni et al. (2015b) and Shang et al. (2015) have drawn in¬ 
spiration from the use of neural networks in natural lan¬ 
guage modeling and machine translation tasks (Cho et al. 
2014). There are several motivations for developing non- 
goal-driven systems. First, they may be deployed directly 
for tasks which do not naturally exhibit a directly measur¬ 
able goal (e.g. language learning) or simply for entertain¬ 
ment. Second, if they are trained on corpora related to the 
task of a goal-driven dialogue system (e.g. corpora which 
cover conversations on similar topics) then these models 
can be used to train a user simulator, which can then train 
the POMDP models discussed earlier (Young et al. 2013; 
Pietquin and Hastie 2013). This would alleviate the expen¬ 
sive and time-consuming task of constructing a large-scale 
task-specific dialogue corpus. In addition to this, the fea¬ 
tures extracted from the non-goal-driven systems may be 
used to expand the state space representation of POMDP 
models (Singh et al. 2002). This can help generalization to 
dialogues outside the annotated task-specific corpora. 

Our contribution is in the direction of end-to-end train- 
able, non-goal-driven systems based on generative proba¬ 
bilistic models. We define the generative dialogue problem 
as modeling the utterances and interactive structure of the 
dialogue. As such, we view our model as a cognitive sys¬ 
tem, which has to carry out natural language understand¬ 
ing, reasoning, decision making and natural language gen¬ 
eration in order to replicate or emulate the behavior of the 
agents in the training corpus. Our approach differs from 
previous work on learning dialogue systems through inter¬ 
action with humans (Young et al. 2013; Gasic et al. 2013; 
Cantrell et al. 2012; Mohan and Laird 2014), because it 
learns off-line through examples of human-human dialogues 
and aims to emulate the dialogues in the training corpus in¬ 
stead of maximize a task-specific objective function. Con¬ 
trary to explanation-based learning (Mohan and Laird 2014) 
and rule-based inference systems (Langley et al. 2014), our 
model does not require a predefined state or action space 
representation. These representations are instead learned 



directly from the corpus examples together with inference 
mechanisms, which map dialogue utterances to dialogue 
states, and action generation mechanisms, which map dia¬ 
logue states to dialogue acts and stochastically to response 
utterances. We believe that training such a model end-to-end 
to minimize a single objective function, and with minimum 
reliance on hand-crafted features, will yield superior perfor¬ 
mance in the long run. Furthermore, we focus on models 
which can be trained efficiently on large datasets and which 
are able to maintain state over long conversations. 

We experiment with the well-established recurrent neural 
networks (RNN) and n-gram models. In particular, we adopt 
the hierarchical recurrent encoder-decoder (HRED) (Sor- 
doni et al. 2015a) and demonstrate that it is competitive with 
other models in the literature. We extend the model archi¬ 
tecture to better suit the dialogue task. We show that perfor¬ 
mance can be substantially improved by bootstrapping from 
pretrained word embeddings and by pretraining the model 
on a larger question-answer pair (Q-A) corpus. To carry out 
experiments, we introduce the MovieTriples dialogue dataset 
based on movie scripts. 


Related Work 

Modeling conversations on micro-blogging websites with 
generative probabilistic models was first proposed by Ritter 
et al. (2011). They view the response generation problem 
as a translation problem, where a post needs to be translated 
into a response. Generating responses was found to be con¬ 
siderably more difficult than translating between languages, 
likely due to the wide range of plausible responses and lack 
of phrase alignment between the post and the response. 

Later, Shang et al. (2015) proposed to use the recur¬ 
rent neural network framework for generating responses on 
micro-blogging websites. This was followed up by Sordoni 
et al. (2015b), who extended the framework from status- 
reply pairs to triples of three consecutive utterances. 

To the best of our knowledge, Banchs et al. (2012) were 
the first to suggest using movie scripts to build dialogue sys¬ 
tems. Conditioned on one or more utterances, their model 
searches a database of movie scripts and retrieves an appro¬ 
priate response. This was later followed up by Ameixa et 
al. (2014), who demonstrated that movie subtitles could be 
used to provide responses to out-of-domain questions using 
an information retrieval system. 


Models 

We consider a dialogue as a sequence of M utterances 
D = {[/i,..., {7 m} involving two interlocutors. Each 
Um contains a sequence of Nm tokens, i.e. Um = 
• ■ •) where Wm,n IS a random variable tak¬ 

ing values in the vocabulary V and representing the token at 
position n in utterance m. The tokens represent both words 
and speech acts, e.g. pause and end of turn tokens. A gen¬ 
erative model of dialogue parameterizes a probability dis¬ 
tribution P - governed by parameters 9 - over the set of all 
possible dialogues of arbitrary lengths. The probability of a 


dialogue D can be decomposed; 
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i.e. the tokens preceding n in the utterance Um- The task is 
analogous to language modeling, with the critical difference 
that speech acts are included as separate tokens. Sampling 
from the model can be performed as in standard language 
modeling: sampling one word at a time from the conditional 
distribution Ps(wm,n\wm,<n, U<^m) conditioned on the pre¬ 
viously sampled words. 

Using standard n-grams to compute joint probabilities 
over dialogues, e.g. computing probability tables for each 
token given the n preceding tokens, suffers from the curse of 
dimensionality and is intractable for any realistic vocabulary 
size. To overcome this, Bengio et al. (2003) proposed a dis¬ 
tributed (dense) vector representation of words, called word 
embeddings, which parameterizes P 0 {wm,n\wm,<n,U<^m) 
as a smooth function using a neural network. By means 
of such distributed representations, the recurrent neural net¬ 
work (RNN) based language model (Mikolov et al. 2010) 
has pushed state-of-the-art performance by learning long n- 
gram contexts while avoiding data sparsity issues. Overall, 
RNNs have performed well on a variety of NLP tasks such 
as machine translation (Cho et al. 2014; Sutskever, Vinyals, 
and Le 2014; Bahdanau, Cho, and Bengio 2015) and infor¬ 
mation retrieval (Sordoni et al. 2015a). 


Recurrent Neural Network 

A recurrent neural network (RNN) models an input se¬ 
quence of tokens {wi,..., w^} using the recurrence: 

hn = fihn-l,Wn), ( 2 ) 


where e is called a recurrent, or hidden, state and 
acts as a vector representation of the tokens seen up to posi¬ 
tion n. In particular, the last state h n may be viewed as an 
order-sensitive compact summary of all the tokens. In lan¬ 
guage modeling tasks, the context information encoded in 
hn is used to predict the next token in the sentence: 


PffiWn+l = v\w<n) = 


exp{g{hn,v)) 


T,v' exp(5({i„,U))' 
The functions / and g are typically defined as: 


f{hn-i,Wn) =tanh{Hhn-i+ IwJ , (3) 

g{hn,v) = Ol,^hn, (4) 

The matrix I G contains the input word embed¬ 

dings, i.e. each column Ij is a vector corresponding to token 
j in the vocabulary V. Due to the size of the model vocab¬ 
ulary V, it is common to approximate the / matrix with a 
low-rank decomposition, i.e. I = XE, where X G 
and E G and de < dh. This approach has also 



what' s wrong ? </s> 


i feel like i' m going to pass out. </s> 
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Figure 1; The computational graph of the HRED architecture for a dialogue composed of three turns. Each utterance is 
encoded into a dense vector and then mapped into the dialogue context, which is used to decode (generate) the tokens in the 
next utterance. The encoder RNN encodes the tokens appearing within the utterance, and the context RNN encodes the temporal 
structure of the utterances appearing so far in the dialogue, allowing information and gradients to flow over longer time spans. 
The decoder predicts one token at a time using a RNN. Adapted from Sordoni et al. (2015a). 


the advantage that the embedding matrix E may separately 
be bootstrapped (e.g. learned) from larger corpora. Analo¬ 
gously, the matrix O G represents the output word 

embeddings, where each possible next token is projected 
into another dense vector and compared to the hidden state 
h„. The probability of seeing token v at position n -f 1 in¬ 
creases if its corresponding embedding vector Oy is “near” 
the context vector hn- The parameter JI is called a recurrent 
parameter, because it links hn-i to h„. All parameters are 
learned by maximizing the log-likelihood of the parameters 
on a training set using stochastic gradient descent. 

Hierarchical Recurrent Encoder-Decoder 

Our work extends the hierarchical recurrent encoder- 
decoder architecture (HRED) proposed by Sordoni et 
al. (2015a) for web query suggestion. In the original frame¬ 
work, HRED predicts the next web query given the queries 
already submitted by the user. The history of past submitted 
queries is considered as a sequence at two levels: a sequence 
of words for each web query and a sequence of queries. 
HRED models this hierarchy of sequences with two RNNs: 
one at the word level and one at the query level. We make 
a similar assumption, namely, that a dialogue can be seen 
as a sequence of utterances which, in turn, are sequences of 
tokens. A representation of HRED is given in Eigure 1. 

In dialogue, the encoder RNN maps each utterance to an 
utterance vector. The utterance vector is the hidden state 
obtained after the last token of the utterance has been pro¬ 
cessed. The higher-level context RNN keeps track of past ut¬ 
terances by processing iteratively each utterance vector. Af¬ 
ter processing utterance Um, the hidden state of the context 
RNN represents a summary of the dialogue up to and includ¬ 


ing turn TO, which is used to predict the next utterance Um+i- 
This hidden state can be interpreted as the continuous-valued 
state of the dialogue system. The next utterance prediction is 
performed by means of a decoder RNN, which takes the hid¬ 
den state of the context RNN and produces a probability dis¬ 
tribution over the tokens in the next utterance. The decoder 
RNN is similar to the RNN language model (Mikolov et al. 
2010 ), but with the important difference that the prediction 
is conditioned on the hidden state of the context RNN. It can 
be interpreted as the response generation module of the di¬ 
alogue system. The encoder, context and decoder RNNs all 
make use of the GRU hidden unit (Cho et al. 2014). Every¬ 
where else we use the hyperbolic tangent as activation func¬ 
tion. It is also possible to use the maxout activation func¬ 
tion between the hidden state and the projected word em¬ 
beddings of the decoder RNN (Goodfellow et al. 2013). The 
same encoder RNN and decoder RNN parameters are used 
for every utterance in a dialogue. This helps the model gen¬ 
eralize across utterances. Eurther details of the architecture 
are described by Sordoni et al. (2015a). 

Eor modeling dialogues, we expect the HRED model to be 
superior to the standard RNN model for two reasons. Eirst, 
because the context RNN allows the model to represent a 
form of common ground between speakers, e.g. to represent 
topics and concepts shared between the speakers using a dis¬ 
tributed vector representation, which we hypothesize to be 
important for building an effective dialogue system (Clark 
and Brennan 1991). Second, because the number of com¬ 
putational steps between utterances is reduced. This makes 
the objective function more stable w.r.t. the model parame¬ 
ters, and helps propagate the training signal for first-order 
optimization methods (Sordoni et al. 2015a). 













Bidirectional HRED 

In HRED, the utterance representation is given by the last 
hidden state of the encoder RNN. This architecture worked 
well for web queries, but may be insufficient for dialogue 
utterances, which are longer and contain more syntactic ar¬ 
ticulations than web queries. For long utterances, the last 
state of the encoder RNN may not reflect important infor¬ 
mation seen at the beginning of the utterance. Thus, we 
also experiment with a model where the utterance encoder 
is a bidirectional RNN. Bidirectional RNNs run two chains: 
one forward through the utterance tokens and another back¬ 
ward, i.e. reversing the tokens in the utterance. The forward 
hidden state at position n summarizes tokens preceding po¬ 
sition n and the backwards hidden state summarizes tokens 
following position n} To obtain a fixed-length represen¬ 
tation for the utterance, we summarize the information in 
the forward and backward RNN hidden states by either: 1) 
taking the concatenation of the last state of each RNN as 
input to the context RNN, or 2) applying L 2 pooling over 
the temporal dimension of each chain, and taking the con¬ 
catenation of the two pooled states as input to the context 
RNN.^ The bidirectional structure will effectively introduce 
additional short term dependencies, which has proven useful 
in similar architectures (Bahdanau, Cho, and Bengio 2015; 
Sutskever, Vinyals, and Le 2014). In experiments below, we 
refer to this variant as HRED-Bidirectional. 

Bootstrapping from Word Embeddings and 
Subtitles Q-A 

The commonsense knowledge that the dialogue interlocu¬ 
tors share may be difficult to infer if the dataset is not 
sufficiently large. Therefore, our models may be im¬ 
proved by learning word embeddings from larger corpora. 
We choose to initialize our word embeddings E with 
Word2Vec^ (Mikolov et al. 2013) trained on the Google 
News dataset containing about 100 billion words. The sheer 
size of the dataset ensures that the embeddings contain rich 
semantic information about each word. 

To learn a good initialization point for all model param¬ 
eters, instead of only the word embeddings, we can further 
pretrain the model on a large non-dialogue corpus, which 
covers similar topics and types of interactions between in¬ 
terlocutors. One such corpus is the Q-A SubTle corpus con¬ 
taining about 5.5M Q-A pairs constructed from movie subti¬ 
tles (Ameixa et al. 2014). We construct an artificial dialogue 
dataset by taking each {Q,A} pair as a two-turn dialogue 
D = {Ui = Q,U 2 = A} and use this to pretrain the model. 

Dataset 

The MovieTriples dataset has been developed by expand¬ 
ing and preprocessing the Movie-DiC dataset by Banchs et 

*The output of the bidirectional RNN is always based on the 
utterance before the current utterance of the decoder RNN. 

^L 2 pooling over an utterance Um is defined as 

K, where h„ is the encoder RNN hidden 
state at position n, and Nm is the length of the utterance. 

^http://code.google.com/p/word2vec/ 



Training 

Validation 

Test 

Movies 

484 

65 

65 

Triples 

196,308 

24,717 

24,271 

Avg. tokens/triple 

53 

53 

55 

Avg. unk/triple 

0.97 

1.22 

1.19 


Table 1: Statistics of the MovieTriples dataset. 


al. (2012) to make it fit the generative dialogue modeling 
framework. The dataset is available upon request. Movie 
scripts span a wide range of topics, contain long interactions 
with few participants and relatively few spelling mistakes 
and acronyms. As observed by Forchini (2009): “movie 
language can be regarded as a potential source for teach¬ 
ing and learning spoken language features”. Therefore, we 
believe bootstrapping a goal-driven spoken dialogue system 
based on movie scripts will improve performance. 

We used the Python-based natural language toolkit NLTK 
(Bird, Klein, and Loper 2009) to perform tokenization and 
named-entity recognition. All names and numbers were re¬ 
placed with the <person> and <number> tokens respec¬ 
tively (Ritter, Cherry, and Dolan 2010). To reduce data spar¬ 
sity further, all tokens were transformed to lowercase letters, 
and all but the 10,000 most frequent tokens were replaced 
with a generic <unk> token. 

We then generated “triples” {Ui, U 2 , i.e. dialogues 
of three turns between two interlocutors A and B, for 
which A emits a first utterance Ui, B responds with U 2 
and A responds with a last utterance (Sordoni et al. 
2015b). To capture the interactive dialogue structure, a spe¬ 
cial end-of-utterance token is appended to all utterances and 
a continued-utterance token between breaks in lines from 
the same speaker. To avoid co-dependencies between triples 
coming from the same movie, we first split the movies 
into training, validation and test set, and then construct the 
triples. Statistics are reported in Table 1. 

Experiments 

We evaluate the different variants of our HRED model, and 
compare against several alternatives, including basic n-gram 
models (Goodman 2001), a standard (non-hierarchical) 
RNN trained on the concatenation of the utterances in each 
triple, and a context-sensitive model (DCGM-I) recently 
proposed by Sordoni et al. (2015b). 

Evaluation Metrics 

Accurate evaluation of a non-goal-driven dialogue sys¬ 
tem is an open problem (Galley et al. 2015; Pietquin and 
Hastie 2013; Schatzmann, Georgila, and Young 2005). 
There is no well-established method for automatic evalua¬ 
tion, and human-based evaluation is expensive. Neverthe¬ 
less, for probabilistic language models word perplexity is 
a well-established performance metric (Bengio et al. 2003; 
Mikolov et al. 2010), and has been suggested for generative 
dialogue models previously (Pietquin and Hastie 2013). We 



define word perplexity: 

exp log PeiU^, U^, , (5) 

for a model with parameters 6, dataset with N triples 
{[/”, U 2 , U^}n=i^ Nw the number of tokens in the en¬ 
tire dataset. Lower perplexity is indicative of a better model. 
Perplexity explicitly measures the model’s ability to account 
for the syntactic structure of the dialogue (e.g. turn-taking) 
and the syntactic structure of each utterance (e.g. punctua¬ 
tion marks). In dialogue, the distribution over the words in 
the next utterance is highly multi-modal, e.g. there are many 
possible answers, which makes perplexity particularly ap¬ 
propriate because it will always measure the probability of 
regenerating the exact reference utterance. 

We also consider the word classification error (also known 
as word error-rate). This is defined as the number of words 
in the dataset the model has predicted incorrectly divided 
by the total number of words in the dataset."^ Each word 
contributes either zero or 1 /Nw to the count, which means 
that it is more robust to unlikely (e.g. unpredictable) words. 
However, it is also less fine-grained than word perplexity. 
Instead of measuring the whole distribution, it only mea¬ 
sures the regions of high probability. 

Ultimately, we care about generating syntactically and 
semantically coherent dialogues. For example, utterances 
which are grammatically correct and reflect the distribution 
of topics in the corpus, and whole dialogues which reflect 
the interaction patterns and topical evolutions of the dia¬ 
logues in the corpus. Despite having been proposed be¬ 
fore (Pietquin and Hastie 2013; Schatzmann, Georgila, and 
Young 2005). it is not clear how well word perplexity and 
word classification errors correlate with this goal. Neverthe¬ 
less, optimizing probabilistic models using word perplex¬ 
ity has shown promising results in several machine learn¬ 
ing tasks including statistical machine translation (Auli et 
al. 2013; Sutskever, Vinyals, and Le 2014; Bahdanau, Cho, 
and Bengio 2015), speech recognition (Hinton et al. 2012; 
Deng and Li 2013) and image caption generation (Kiros, 
Salakhutdinov, and Zemel 2014; Vinyals et al. 2015). Based 
on these empirical findings, we expect to be able to discrim¬ 
inate between models based on word perplexity, and to use 
word classification error and qualitative analysis of gener¬ 
ated dialogues to understand the performance of the models 
in depth. 

Training Procedure 

To train the neural network models, we optimized the log- 
likelihood of the triples using the recently proposed Adam 
optimizer (Kingma and Ba 2014).^ Our implementation re¬ 
lied on the open-source Python library Theano (Bastien et 


"'For a word prediction to be counted as correct, both the word 
and its position in the utterance must be correct. 

^We truncated all triples to have a maximum size of 80 tokens, 
and could therefore apply backpropagation on the full token se¬ 
quences. 


al. 2012).® The best hyperparameters of the models were 
chosen by early stopping with patience on the validation set 
perplexity (Bengio 2012). We initialized the recurrent pa¬ 
rameter matrices as orthogonal matrices, and all other pa¬ 
rameters from a Gaussian random distribution with mean 
zero and standard deviation 0.01. For the baseline RNN, 
we tested hidden state spaces dh = 200, 300 and 400. For 
HRED we experimented with encoder and decoder hidden 
state spaces of size 200, 300 and 400. Based on prelimi¬ 
nary results and due to GPU memory limitations, we limited 
ourselves to size 300 when not bootstrapping or bootstrap¬ 
ping from Word2Vec, and to size 400 when bootstrapping 
from SubTle. Preliminary experiments showed that the con¬ 
text RNN state space at and above 300 performed similarly, 
so we fixed it at 300 when not bootstrapping or bootstrap¬ 
ping from Word2Vec, and to 1200 when bootstrapping from 
SubTle. For all models, we used word embedding of size 
400 when bootstrapping from SubTle and of size 300 other¬ 
wise. To help generalization, we used the maxout activation 
function, between the hidden state and the projected word 
embeddings of the decoder RNN, when not bootstrapping 
and when bootstrapping from Word2Vec. We used L 2 pool¬ 
ing for all HRED models, except when bootstrapping from 
SubTle since it appeared to perform slightly worse. 

Bootstrapping Word Embeddings Our embedding ma¬ 
trix E was initialized using the publicly available 300 dim. 
Word2Vec embeddings trained on the Google News cor¬ 
pus (Mikolov et al. 2013). Special dialogue tokens, which 
did not exist in the Word2Vec embeddings, were initialized 
from a Gaussian random distribution as before. The training 
procedure was carried out in two stages. First, we trained 
each neural model with fixed Word2Vec embeddings. Dur¬ 
ing this stage, we also trained the speech act and placeholder 
tokens, together with tokens not covered by the original 
Word2Vec embeddings. In the second stage, we then trained 
all parameters of each neural model until convergence. 

Bootstrapping SubTle We processed the SubTle corpus 
by following the same procedure as used for MovieTriples, 
but treating the last utterances as empty. The final Sub- 
Tie corpus contained 5,503,741 Q-A pairs, and a total of 
93,320,500 tokens. When bootstrapping from the SubTle 
corpus, we found that all models performed slightly better 
when randomly initializing and learning the word embed¬ 
dings from SubTle compared to fixing the word embeddings 
to those given by Word2Vec. For this reason, we do not 
report results combining bootstrapping from the SubTle cor¬ 
pus with Word2Vec word embeddings. 

The HRED models were pretrained for approximately 
four epochs on the SubTle dataset, after which performance 
did not appear to improve further. Then, we fine-tuned the 
pretrained models on the MovieTriples dataset holding the 
word embeddings fixed. 


®The model implementations can be found on GitHub: 

https://github.com/julianser/hed-dlg 
https://github.com/julianser/rnn-lm. 




Model 

Perplexity 

Perplexity @ 1/3 

Error-Rate 

Error-Rate @ U3 

Backoff N-Gram 

64.89 

65.05 

- 

- 

Modified Kneser-Ney 

60.11 

54.75 

- 

- 

Absolute Discounting N-Gram 

56.98 

57.06 

- 

- 

Witten-Bell Discounting N-Gram 

53.30 

53.34 

- 

- 

RNN 

35.63 ±0.16 

35.30 ±0.22 

66.34% ± 0.06 

66.32% ± 0.08 

DCGM-I 

36.10 ±0.17 

36.14 ±0.26 

66.44% ± 0.06 

66.57% ±0.10 

HRED 

36.59 ±0.19 

36.26 ±0.29 

66.32% ± 0.06 

66.32% ±0.11 

HRED - 1 - Word2Vec 

33.95 ±0.16 

33.62 ±0.25 

66.06% ± 0.06 

66.05% ± 0.09 

RNN - 1 - SubTle 

27.09 ±0.13 

26.67 ±0.19 

64.10% ±0.06 

64.07% ±0.10 

HRED - 1 - SubTle 

27.14 ±0.12 

26.60 ±0.19 

64.10% ±0.06 

64.03% ±0.10 

HRED-Bi. - 1 - SubTle 

26.81 ± 0.11 

26.31 ± 0.19 

63.93% ± 0.06 

63.91% ±0.09 


Table 2: Test set results computed on {Ui, U 2 , C/ 3 } and solely on {C/ 3 } conditioned on {Ui, C/ 2 }. Standard deviations are shown 
for all neural models. Best performances are marked in bold. 


Reference (Ui, U 2 ) 

MAP 

Target (U 3 ) 

U 1 : yeah , okay . 

U 2 : well, i guess i ’ 11 be going now . 

i ’ 11 see you tomorrow . 

yeah . 

Ui: oh . <continued_utterance> oh . 

U 2 : what ’ s the matter , honey ? 

i don ’ t know . 

oh . 

Ui: it ’ s the cheapest. 

U 2 : then it ’ s the worst kind ? 

no , it ’ s not. 

they ’ re all good , sir . 

Ui: <person> ! what are you doing ? 

U 2 : shut up ! c ’ mon . 

what are you doing here ? 

what are you that crazy ? 


Table 3: MAP outputs for HRED-Bidirectional bootstrapped from SubTle corpus. The first column shows the reference utter¬ 
ances, where Ui and C /2 are respectively the first and second utterance in the test triple. The second column shows the MAP 
output produced by beam-search conditioned on Ui and C/ 2 . The third column shows the actual third utterance in the test triple. 


Empirical Results 

Our results are presented in Table 2. All neural models 
beat state-of-the-art n-grams models substantially w.r.t. both 
word perplexity and word classification error. Without boot¬ 
strapping, the RNN model performs similarly to the more 
complex DCGM-I and HRED models. This can be ex¬ 
plained by the size of the dataset, which makes it easy for 
the HRED and DCGM-I model to overfit. The last four 
lines of Table 2 confirm that bootstrapping the model pa¬ 
rameters achieves significant gains in both measures. Boot¬ 
strapping from SubTle is particularly useful since it allows a 
gain of nearly 10 perplexity points compared to the HRED 
model without bootstrapping. We believe that this is be¬ 
cause it trains all model parameters, unlike bootstrapping 
from Word2Vec, which only trains the word embeddings. 

In general, we find that the gains due to architectural 
choice are smaller than those obtained by bootstrapping, 
which can be explained by the fact that we are in a regime 
of relatively little training data compared to other natural 
language processing tasks, such as machine translation, and 
hence we would expect the differences to grow with more 
training data and longer dialogues. Overall, the bidirectional 
structure appears to capture and retain information from the 
Ui and U 2 utterances better than either the RNN and the 
original HRED model. This confirms our earlier hypothesis, 
and demonstrates the potential of HRED for modeling long 
dialogues. 


MAP Outputs 

We also considered the use of beam-search for 
RNNs (Graves 2012) to approximate the most prob¬ 
able (MAP) last utterance, U^, given the first two 
utterances, Ui and U 2 - MAP outputs are presented in 
Table 3 for HRED-Bidirectional bootstrapped from Sub- 
Tie corpus. As shown here, the model often produced 
sensible answers. However, in fact, the majority of 
the predictions were generic, such as I don’t know or 
I’m sorry. We observed the same phenomenon for the 
RNN model, and similar observations can be inferred 
by remarks in the recent literature (Sordoni et al. 2015b; 
Vinyals and Le 2015). To the best of our knowledge, we are 
the first to emphasize and discuss it in details.^ 

There are several possible explanations for this behav¬ 
ior. Firstly, due to data scarcity, the models may only have 
learned to predict the most frequent utterances. Since the 
dialogues are inherently ambiguous and multi-modal, pre¬ 
dicting them accurately would require more data than other 
natural language processing tasks. Secondly, the majority 
of tokens were punctuation marks and pronouns. Since ev¬ 
ery token is weighted equally during training, the gradient 
signal of the neural networks is dominated by these punctu- 


^After publishing the first draft of this paper, Li et al. (2015) in¬ 
vestigated this problem further and proposed to change the objec¬ 
tive function at test time to also maximize the mutual information 
between the generated utterance and the previous utterances. 





ation and pronoun tokens. This makes it hard for the neural 
networks to learn topic-specific embeddings and even harder 
to predict diverse utterances. This suggests exploring neu¬ 
ral architectures which explicitly separate semantic structure 
from syntactic structure. Finally, the context of a triple may 
be too short. In that case, the models should benefit from 
longer contexts and by conditioning on other information 
sources, such as semantic and visual information. 

An important implication of this observation is that met¬ 
rics based on MAP outputs (e.g. cosine similarity, BLEU, 
Levenshtein distance) will primarily favor models that out¬ 
put the same number of punctuation marks and pronouns 
as are in the test utterances, as opposed to similar semantic 
content (e.g. nouns and verbs). This would be systemati¬ 
cally biased and not necessarily in any way correlate with 
the objective of producing appropriate responses. We there¬ 
fore cannot justify the use of such metrics when the results 
are known to lack diversity. 

Nevertheless, we also note that this problem did not occur 
when we generated stochastic samples (as opposed to the 
MAP outputs). In fact, the stochastic samples contained a 
large variety of topic-specific words and often appeared to 
maintain the topic of the conversation. 

Conclusion and Future Work 

We have demonstrated that a hierarchical recurrent neu¬ 
ral network generative model can outperform both n-gram 
based models and baseline neural network models on the 
task of modeling utterances and speech acts. To sup¬ 
port our investigation, we introduced a novel dataset called 
MovieTriples based on movie scripts, which are suitable for 
modeling long, open domain dialogues close to human spo¬ 
ken language. In addition to the recurrent hierarchical ar¬ 
chitecture, we found two crucial ingredients for improving 
performance: the use of a large external monologue corpus 
to initialize the word embeddings, and the use of a large re¬ 
lated, but non-dialogue, corpus in order to pre train the recur¬ 
rent net. This points to the need for larger dialogue datasets. 

Empirical performance of all models was measured us¬ 
ing perplexity. While this is an established measure for 
generative models, in the dialogue setting utterances may 
be overwhelmed by many common words especially arising 
from colloquial or informal exchanges. It may be fruitful 
to investigate other measures of performance for genera¬ 
tive dialogue systems. We also considered actual responses 
produced by the model. The MAP outputs tended to pro¬ 
duce somewhat generic, but conversationally acceptable, re¬ 
sponses. Stochastic samples from the model produced more 
diverse dialogues. 

Euture work should study models for full length dia¬ 
logues, as opposed to triples, and model other speech acts, 
such as interlocutors entering or leaving the dialogue and 
executing actions. Einally, our analysis of the model MAP 
outputs suggests that it would be beneficial to include longer 
and additional context, including other modalities such as 
audio and video. 
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